Introduction to Hypothesis Testing

Esteban Montenegro-Montenegro, PhD

Psychology and Child Development

Today’s aims

  • We will study the concept of p-value a.k.a significance test.

  • I will introduce the concept of hypothesis testing.

  • I will also introduce the mean comparison for independent groups.

What is a p-value?

  • p-value stands for probability value.

  • It was born as a measure to reject a hypothesis.

  • In statistics and science, we always have hypothesis in mind. Statistics translates our hypothesis into evaluations of our hypothesis.

  • For example, we often ask questions to our selves such as: why my boyfriend won’t express her feelings? and then we ask, is this related to gender? Is it true that women share easily their emotions compare to men? If so, does it happen only to me? Is this a coincidence?

  • We can create a hypothesis with these questions, let’s try to write one:

\(H_{1}\) = There is a difference in emotional expression between cisgender women and cisgender men.

  • This is our alternaive hypothesis but we need a null hypothesis:

\(H_{0}\) = There is no difference in emotional expression between cisgender women and cisgender men.

  • That was easy you might think, but why do we need a null statement?

    • Science always starts with a null believe, and what we do as scientist is to collect evidence that might help to reject the null hypothesis. If you collect data to support your alternative hypothesis you would be doing something called “confirmation bias”.
    • Confirmation bias consist of collecting information that only supports your alternative hypothesis.
    • For example, you start collecting data that only proofs that all swans are white, instead of looking at information that helps to reject the null: not all swans are white.

What is a p-value? II

  • We can write out null hypothesis in a statistical statement:

\(H_{0}\) = The mean difference in emotional expression between cisgender women and cisgender men is equal to zero.

  • In the previous hypothesis we know that we are focusing in the mean difference, it is more specific.

  • The very first step to test our null hypothesis is to create a null model.

  • A null model is a model where there is not any difference between groups, or there is not relationship between variables.

What is a p-value? III

  • In my new model I will find a null model for the correlation between rumination and depression.

  • To create our new model we will re sample and shuffle our observed data. This is similar to have two decks of cards and you shuffle your cards multiple times until it is hard to guess which card will come next, and imagine cards have equal probability to be selected.

  • This procedure is called permutations, this will help us to create a distribution of null correlations. This means, all the correlations produced by my null model are produced by chance.

What is a p-value? IV

  • Let’s see the following example, remember the estimated correlation between rumination and depression is \(r= 0.58\). This null model will help us to know if the correlation is explained by chance.
rum <- read.csv("ruminationComplete.csv", na.string = "99") ## Imports the data into R

rum_scores <- rum %>% mutate(rumination = rowSums(across(CRQS1:CRSQ13)),
                             depression =  rowSums(across(CDI1:CDI26))) ### I'm calculating
                                                                       ## total scores


corr <- cor(rum_scores$rumination, rum_scores$depression,
            use =  "pairwise.complete.obs") ## Correlation between rumination and depression

### Let's create a distribution of null correlations

nsim <- 100000

cor.c <- vector(mode = "numeric", length = nsim)

for(i in 1:nsim){
depre <- sample(rum_scores$depression, 
                212, 
                replace = TRUE)

rumia <- sample(rum_scores$rumination, 
                212, 
                replace = TRUE)

cor.c[i] <- cor(depre, rumia, use =  "pairwise.complete.obs")
}


hist(cor.c, breaks = 120, 
     xlim= c(min(cor.c), 0.70),
     main = "Histograma of null correlations")
abline(v = corr, col = "darkblue", lwd = 2, lty = 1)
abline(v = c(quantile(cor.c, .025),quantile(cor.c, .975) ),
 col= "red",
 lty = 2,
 lwd = 2)

What is a p-value? V

Let’s estimate the probability of seeing \(r = 0.58\) according to our null model.

pVal <- 2*mean(cor.c >= corr)
pVal
[1] 0
  • The probability is a number close to \(0.00\).

  • We now conclude that a correlation as extreme as \(r = 0.58\) is not explained by chance alone.

Note:

The ugly rule of thumb is to consider a p-value <= .05 as evidence of small probability.

Mean difference

  • We can also do the same for the difference between means.

  • In this example, my null model is a model with null differences.

  • I also conducted permutations on this example.

  • We will estimate if the difference by sex in rumination is explained by chance.

  • In this sample the mean difference in rumination between males and females is \(\Delta M\) = 2.74.

set.seed(1236)

ob.diff <- mean(rum_scores$rumination[rum_scores$sex==0], na.rm = TRUE )- mean(rum_scores$rumination[rum_scores$sex==1], na.rm = TRUE)

### let's create a distribution of null differences

nsim <- 100000

diff.c <- vector(mode = "numeric", length = nsim)

### This is something called "loop", you don't have to pay attention to this.

for(i in 1:nsim){
women <- sample(rum_scores$rumination[rum_scores$sex==0], 
                length(rum_scores$rumination[rum_scores$sex==0]), 
                replace = TRUE)

men <- sample(rum_scores$rumination[rum_scores$sex==1], 
                length(rum_scores$rumination[rum_scores$sex == 1]), 
                replace = TRUE)

diff.c[i] <- mean(women,  na.rm = TRUE)-mean(men,  na.rm = TRUE)
} 


hist(diff.c, breaks = 120, 
     #xlim= c(min(diff.c), 0.70),
     main = "Histogram of Null Differences")
abline(v = ob.diff, col = "darkblue", lwd = 2, lty = 1)
abline(v = c(quantile(diff.c, .025), quantile(diff.c, .975) ),
 col= "red",
 lty = 2,
 lwd = 2)

  • We can see the probability of seeing a value as large as 2.74
pVal <- 2*mean(diff.c >= ob.diff)
pVal
[1] 0.99652

Mean difference II

  • In real life, we don’t have to estimate a null model “by hand” as I did before.

  • R and JAMOVI will help us on that because the null model is already programmed.

  • In addition, when we compare independent means, we don’t usually do permutations. We follow something called the \(t\) - distribution , let’s study more about this distribution: t-distribution.

Mean difference III

  • The \(t\)-distribution helped to develop the test named t-Student Test.

  • In this test we use the \(t\)-distribution as our model to calculate the probability to observe a value as extreme as 2.74.

  • But this probbaility will be estimated following the Cumulative Density Function (CDF) of the \(t\)-distribution.

\(t\)-Test

  • The Student’s test is also known as a the “t-test”.

  • In this test, we will transform the mean difference of both groups into a \(t\) value.

\[\begin{equation} t= \frac{\bar{X}_{1} - \bar{X}_{2}}{\sqrt{\Big [\frac{(n_{1}-1)s^{2}_{1}+(n_{2}-1)s^{2}_{2}}{n_{1} + n_{2}-2}\Big ]\Big [\frac{n_{1}+n_{2}}{n_{1}n_{2}} \Big ]}} \end{equation}\]
  • In this transformation \(n_{1}\) is the sample size for group 1, \(n_{2}\) is the sample size for group 2, \(s^2\) means variance. The \(\bar{X}\) represents the mean.

  • This formula will help us to tranform the oberved difference in means to a value that comes from the \(t\)-distribution.

\(t\)-Test III

  • Remember we talked about the \(t\)-distribution’s CDF. This CDF will help us to estimate the probability of seeing a value. the y-axis represents probability values.

\(t\)-Test IV

  • We can see how useful is a \(t\)-test by presenting a applied example.

  • In this example we will try to reject the null hypothesis that says:

“The rumination score in males is equal to the rumination score in females”

  • We represent this hypothesis in statitistic like this:
\[\begin{equation} H_{0}: \mu_{1} = \mu_{0} \end{equation}\]
  • Also in this example I’m introducing a new function in R named t.test(). This is the function that will helps to know if we can reject the null hypothesis.

  • The function t.test() requires a formula created by using tilde ~.

  • In R the the variable on the right side of ~ is the independent variable, the variable on the left side of ~ is the dependent variable.

  • In a independent samples \(t\)-test the independent variable is always the group, and the dependent variable is always any continuous variable.

t.test(rumination ~ sex, data = rum_scores, var.equal = TRUE)

    Two Sample t-test

data:  rumination by sex
t = 2.2457, df = 203, p-value = 0.0258
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 0.3347849 5.1535481
sample estimates:
mean in group 0 mean in group 1 
       31.20896        28.46479 
  • In this example, we found that the \(p\)-value is 0.03 and the \(t\)-value is 2.25. This means:

***“IF we repeat the same analysis with multiple samples the probability of finding a*** \(t\)-value = 2.25 is p = 0.03, under the assumptions of the null model”.

  • This is a very small probability, what do you think? Is 2.25 a value explainable by chance alone?

\(t\)-test Assumptions

  • We haven’t talked about the assumptions of the t-test model.

  • Remember that all models have assumptions, we have to assume something to study Nature.

  • The \(t\)-test model assumes that the difference in means is generated by a normally distributed process.

  • It also assumes that variances are equal on both groups.

  • Let’s see what happens when we assume equal variances but the data does not come from a process with equal variances:

Show the code
set.seed(1234)

N1 <- 50 ## Sample size group 1

N2 <- 50 ### sample size group 2

Mean1 <- 100 ## Mean group 1

Mean2 <- 20 ### Mean group 2

results <- list()


for(i in 1:10000){
group1 <- rnorm(N1, mean = Mean1, sd = 100) ### variances or standard deviation are not equal

group2 <-  rnorm(N2, mean = Mean2, sd = 200) ### variances or standard deviation are not equal

dataSim <- data.frame(genValues = c(group1,group2), 
                      groupVar = c(rep(1,N1),rep(0,N2)))

results[[i]] <- t.test(genValues ~ groupVar, data = dataSim, var.equal = TRUE)$p.value
}

### Proportion of times we rejected the null hypothesis

cat("Proportion of times we rejected the null hypothesis",sum(unlist(results) <= .05)/length(results)*100)
Proportion of times we rejected the null hypothesis 70.5
  • We successfully rejected the null hypothesis in only 70.5% of the data sets generated. But in reality the \(t\)-test should reject the null hypothesis 100% of the times.

  • Let’s check when we assume equal variances in our \(t\) - test and the model that generates is actually a process with equal variances:

Show the code
set.seed(1234)

N1 <- 50

N2 <- 50

Mean1 <- 100

Mean2 <- 20

results_2 <- list()

for(i in 1:10000){
  
group1 <- rnorm(N1, mean = Mean1, sd = 5) ## equal variances or standard deviation

group2 <-  rnorm(N2, mean = Mean2, sd = 5) ## equal variances or standard deviation

dataSim <- data.frame(genValues = c(group1,group2), 
                      groupVar = c(rep(1,N1),rep(0,N2)))

results_2[[i]] <- t.test(genValues ~ groupVar, data = dataSim, var.equal = TRUE)$p.value
}

### Probability of rejecting the null hypothesis

cat("Proportion of times we rejected the null hypothesis", sum(unlist(results_2) <= .05)/length(results_2)*100)
Proportion of times we rejected the null hypothesis 100
  • This time we are rejecting the null hypothesis 100% of the times. This is what we were looking for! Remember that we generated data from a process where group 1 had a mean of 100, and the group 2 had a mean of 20. The t-test should reject the null hypothesis every time I generate new data sets, but this doesn’t happen when I made the wrong assumption: I assumed equal variances when I should not do it.

  • Summary: when we wrongly assume that the variances are equal between groups, we decrease the probability to reject the null hypothesis when the null should be rejected. This is bad!

  • These simulations showed the relevance of respecting the assumptions of the \(t\)-test.

How do we know if my observed data holds the assumption ?

  • There are tests to evaluate the assumption of equivalence of variance between groups.

  • The most used test to evaluate the homogeneity of variance is the Levene’s Test for Homogeneity of Variance.

  • We can implement this test in R using the function leveneTest() , this function comes with the R package car. You might need to install this package in case you don’t have it installed in your computer, you can run the this line of code to install it: install.packages("car")

  • I’m going to test if the variance of rumination holds the assumption of equality of variance by sex:

library(car)

leveneTest(rumination ~ as.factor(sex), 
           data = rum_scores) 
Levene's Test for Homogeneity of Variance (center = median)
       Df F value Pr(>F)
group   1  0.2541 0.6148
      203               
  • In this test the null hypothesis is “The variances of group 1 and group 2 are equal”, if the \(p\)-value is less or equal to 0.05 we reject the null hypothesis. In the output above you can see the p-value under the column Pr(>F).

  • In the case of rumination, the \(p\)-value = 0.61, given that the \(p\)-value is higher than 0.05, we don’t reject the null hypothesis. We can assume the variances of rumination by sex are equal.

How do we know if my observed data holds the assumption ? II

  • What happens if the Levene’s Test rejects the null hypothesis of homogeneity of variance?

  • Can we continue using the \(t\) - test to evaluate my hypothesis?

  • The answer is: Yes you can do a \(t\)-test but there is a correction on the degrees of freedom. We will talk more about degrees of freedom in the next sessions.

  • If you cannot assume equality of variances, all you have to do in R is to switch the argument var.equal = TRUE to var.equal = FALSE.

t.test(rumination ~ sex, data = rum_scores, var.equal = FALSE)

    Welch Two Sample t-test

data:  rumination by sex
t = 2.2841, df = 149.72, p-value = 0.02377
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 0.3702483 5.1180847
sample estimates:
mean in group 0 mean in group 1 
       31.20896        28.46479 
  • Now , the output says we are performing a Welch Two Sample t-test, Welch was tha mathematician who found the correction.

Effect size

  • Up to this point we have studied how we test our hypothesis when we compare independent means, but we still have to answer the question, how large is a large difference between means? Or, how small is a small difference? In fact, what is considered a small difference?

  • These questions were answered by Jacob Cohen (1923-1998).

  • Cohen created a standardized measure to quantify the magnitude of the difference between means.

  • Let’s see a pretty plot about it in this link.

References